CRFsuite Tutorial
Task description
In this example, NP stands for a noun phrase, VP for a verb phrase, and PP for a prepositional phrase.
sequential labeling task(系列ラベリング)
The goal of this tutorial is to build a model that predicts chunk labels for a given sentence (sequence of tokens) by using CRFsuite.
Training and testing data
例:London JJ B-NP
The data consists of a set of sentences (sequences) each of which contains a series of words (e.g., 'London', 'shares'), part-of-speech tags (e.g., 'JJ', 'NNS'), and chunk labels (e.g., 'B-NP', 'I-NP') separated by space characters.
Necessary scripts for this tutorial are included under example directory in the CRFsuite distribution.
(lessってtxt.gzの中、見えるんだ)
In this tutorial, we would like to construct a CRF model that assigns a sequence of chunk labels, given a sequence of words and part-of-speech codes.
「単語と品詞コードの系列が与えられたときに、チャンクラベルの系列を割り当てるCRFモデルを構築する」
Feature (attribute) generation
{train,test}.txt.gzを{train,test}.crfsuite.gzに変換する
タブ区切りの特徴量
In general, this is the most important process for machine-learning approaches because a feature design greatly affects the labeling accuracy.
(ディープラーニング前なので特徴量エンジニアリングの比重が大きい)
In this tutorial, we extract 19 kinds of attributes from a word at position t (in offsets from the begining of a sequence)
前後2単語(w[t-2], w[t-1], w[t], w[t+1], w[t+2])
単語の連続(w[t-1]|w[t], w[t]|w[t+1])
前後2単語の品詞(pos[t-2], pos[t-1], pos[t], pos[t+1], pos[t+2])
品詞の連続
1語の品詞考慮(pos[t-2]|pos[t-1], pos[t-1]|pos[t], pos[t]|pos[t+1], pos[t+1]|pos[t+2])
2語の品詞考慮(pos[t-2]|pos[t-1]|pos[t], pos[t-1]|pos[t]|pos[t+1], pos[t]|pos[t+1]|pos[t+2])
(訓練データ(=コーパス)全体を通して数え上げる?)
CRFsuite will learn associations between these attributes (e.g, "pos[0]|pos[1]|pos[2]=DT|JJ|NN") and labels (e.g., "B-NP") to predict a label sequence for a given text.
The convention "name=value" is merely for the convenience to interpret attribute names
CRFsuite accepts any string as an attribute name as long as the string does not contain a colon character
Training
crfsuite learn
-mで指定したモデルができる
You can also train a CRF model, watching its performance (accuracy, precision, recall, f1 score) evaluated on the test data.
crfsuite learn -e2
Tagging
crfsuite tag
Dumping the model file
crfsuite dump
Notes on writing attribute extractors
the common staffs (attribute generation from templates, data I/O, etc) are implemented in other modules, crfutils.py and template.py.
chunking.pyは3つの変数を定義するだけ
separator
separator character(s) of an input data
例では半角スペース
fields
field name(s) (ordered from left to right) of an input data, separated by a space character
3列:w, pos, y
London(=w) JJ(=pos) B-NP(=y)
templates
attribute (feature) templates written as a Python tuple/list object
a tuple/list of (name, offset) pairs, in which name presents a field name, and offset presents an offset to the current position.
(('w', -2), ),=w[t-2]
(('w', -1), ('w', 0)),
the bigram starting at the previous token
(単語の連続を表している)
Feature extractors for other tasks